{
 "cells": [
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### Further Hypothesis Testing"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "# Select this cell and type Ctrl-Enter to execute the code below.\n",
    "\n",
    "import numpy as np\n",
    "import pandas as pd\n",
    "from scipy import stats\n",
    "import matplotlib.pyplot as plt\n"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## 1 - Introduction"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "In this workshop, you will apply a number of commonly encountered parametric and non-parametric tests to answer a variety of research questions about an [example data set](https://www.kaggle.com/deepu1109/star-dataset).\n",
    "\n",
    "We begin with a brief exploration of the data."
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### Data set"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "The file `stars.csv` contains a dataset of 240 stars, with five variables for each star:\n",
    "\n",
    "|variable | description|\n",
    "|:---|:---|\n",
    "|temperature | the surface temperature (K)|\n",
    "|luminosity | luminosity relative to sun|\n",
    "|radius | radius relative to sun|\n",
    "|spectral_class | the spectral class of each star (O,B,A,F,G,K or M)|\n",
    "|type | as defined below|\n",
    "\n",
    "\n",
    "The luminosity and radius of each star is calculated relative to that of the Sun:\n",
    "\n",
    "$L_{sun} = 3.83 \\times 10^{26}\\text{W}$\n",
    "\n",
    "$R_{sun} = 6.96 \\times 10^8\\text{m}$\n",
    "\n",
    "\n",
    "The stars are classified into 6 types:\n",
    "\n",
    "code | type\n",
    ":---|:---\n",
    "0 |Brown Dwarf\n",
    "1 |Red Dwarf\n",
    "2 |White Dwarf\n",
    "3 |Main Sequence\n",
    "4 |Supergiant\n",
    "5 |Hypergiant\n",
    "\n",
    "The dataset contains 40 examples of each type.\n"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Load the data using the `pandas` library:"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "data = pd.read_csv(\"stars.csv\")\n",
    "type_key = ['Brown Dwarf', 'Red Dwarf', 'White Dwarf', 'Main Sequence', 'Supergiant','Hypergiant']\n",
    "\n",
    "data.head()"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "We will be using histogram plots from `matplotlib` to visualise distributions of these variables.\n",
    "\n",
    "For example, execute the following to see an overall histogram of luminosity:"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "sample = data.luminosity\n",
    "xlab = 'luminosity'\n",
    "ylab = 'freq'\n",
    "\n",
    "bins = np.linspace(sample.min(), sample.max())\n",
    "ax = plt.axes()\n",
    "ax.set_xlabel(xlab)\n",
    "ax.set_ylabel(ylab)\n",
    "plt.hist(sample, bins, alpha=0.5, color='gray')\n",
    "plt.show()"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Or, more usefully, on a log scale:"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "sample = data.luminosity.apply(np.log)\n",
    "xlab = 'log(luminosity)'\n",
    "ylab = 'freq'\n",
    "\n",
    "bins = np.linspace(sample.min(), sample.max())\n",
    "ax = plt.axes()\n",
    "ax.set_xlabel(xlab)\n",
    "ax.set_ylabel(ylab)\n",
    "plt.hist(sample, bins, alpha=0.5, color='gray')\n",
    "plt.show()"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "We can split the data by another variable to compare different groups of stars. \n",
    "\n",
    "For example, the following shows log(luminosity), grouped by type:"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "sample = data.luminosity.apply(np.log)\n",
    "grouped = sample.groupby(data.type)\n",
    "xlab = 'log(luminosity)'\n",
    "ylab = 'freq'\n",
    "\n",
    "bins = np.linspace(sample.min(), sample.max())\n",
    "ax = plt.axes()\n",
    "ax.set_xlabel(xlab)\n",
    "ax.set_ylabel(ylab)\n",
    "grouped.apply(lambda x: plt.hist(x, bins, alpha=0.5, color = 'C' + str(x.name), label=type_key[x.name]))\n",
    "ax.legend(loc='upper center')\n",
    "plt.show()"
   ]
  }
 ],
 "metadata": {
  "kernelspec": {
   "display_name": "Python 3",
   "language": "python",
   "name": "python3"
  },
  "language_info": {
   "codemirror_mode": {
    "name": "ipython",
    "version": 3
   },
   "file_extension": ".py",
   "mimetype": "text/x-python",
   "name": "python",
   "nbconvert_exporter": "python",
   "pygments_lexer": "ipython3",
   "version": "3.7.6"
  }
 },
 "nbformat": 4,
 "nbformat_minor": 4
}
